For the SENIC data set, consider the model regressing loglength on xray, census and age.
Plot studentized residual, leverage and Cooks D by id and identify observations that have unusual leverage, residual or Cook’s D. You can make three separate plots or produce a “bubble” plot.
Identify the two hospitals with the highest Cook’s D. By examining influence statistics such as leverage and residual, and log(length) and predictor values, can you discover what are the characteristics of these hospitals that make them potentially influential (that is, give them a high Cook’s D)?
Rather than rely completely on influence measures, it is better to conduct sensitivity analyses to see whether our inferences really do depend on one or a few unusual observations. Run the model after dropping these observations. (We are doing this to see what happens. I do not recommend that you automatically drop influential observations! Rather, you need to do an investigation.) How do the results compare to the results when fitting the model to the full data set? Compare regression coef estimates, p-values and root MSE.
Conduct model diagnostics for the partial relationships in the model using component-plus-residual plots. Decide whether any of the predictors should be transformed to improve the model. Provide the output for your final model.
Answer:
a
The SAS codes are as follows:
The leverage plot is as follows:
Observations with IDs 104, 112 have high leverage. The values are 0.1525 and 0.1523, respectively.
The studentized residual plot is as follows:
All the observations with studentized residual larger than 2 or smaller than -2 are unusual. So observations with ID 47 is unusual and it has the highest studentized residual and the value is 4.0883. Observations 33, 43, 54, 112, 26, and 103 are unusual.
The Cook’s D plot is as follows:
Observations with IDs 47, 112 appear to have high influence. They have Cook’s D of 0.2185 and 0.1733, respectively.
b
This is the summary statistics of the log(length), xray, census and age:
Observation with ID 47 has high influence because it has high studentized residual. And after we check its log(length) value, we find that it is 2.9735 and it is the maximum among all observations. The distribution of log(length) is right-skewed, and this observation is an outlier. This means average length of stay of patients in hospital 47 (in days) is much longer than other hospitals.
This is the picture of the log(length) distribution:
This is the picture of the log(length) distribution with ID = 47:
Observation with ID 112 has high influence because it has high leverage. And after we check its three predictors’ values (xray, census and age), we find that census has extremely high value, 791, and it is the maximum among all observations. The distribution of census is right-skewed, and this observation is an outlier. This means average number of patients in hospital 112 per day during study period is much larger than other hospitals.
This is the picture of the census distribution:
This is the picture of the census distribution with ID = 112:
In brief, the two hospitals with the highest Cook’s D are busy and crowded.
c
The SAS codes are as follows:
The results with unusual observations are as follows:
After dropping the two unusual observations, the results are as follows:
Coefficient of xray (with unusual observations): \(0.00330\), p-value \(<0.0001\).
Coefficient of xray (without unusual observations): \(0.00283\), p-value \(<0.0001\).
Coefficient of census (with unusual observations): \(0.00054446\), p-value \(<0.0001\).
Coefficient of census (without unusual observations): \(0.00045298\), p-value \(<0.0001\).
Coefficient of age (with unusual observations): \(0.00812\), p-value = \(0.0072\).
Coefficient of age (without unusual observations): \(0.00578\), p-value = \(0.0385\).
Root MSE (with unusual observations): \(0.13975\).
Root MSE (without unusual observations): \(0.12799\).
Therefore, coefficient estimates, p-values and root MSE are almost the same. This means our inferences do not depend on one or a few unusual observations.
d
The SAS codes are as follows:
The parameter estimates are as follows:
The component-plus-residual plot for age is as follows:
The component-plus-residual plot for xray is as follows:
The component-plus-residual plot for census is as follows:
So it seems that age and xray do not need to transform, but census need to transform. We need to go down ladder in x (census) so we use a log transformation for census.
The transformation SAS codes are as follows:
The new component-plus-residual plot for census is as follows:
This plot looks better than before. So we use the transformed predictors to fit the model and the final model is as follows:
For the SENIC data set, use best subsets selection to select a best model. The outcome variable is risk.
The pool of candidate predictors is region, beds, services, medsch, xray, length and a new variable created as nurses/census (nurse/patient ratio).
Before model selection, log-transform positively skewed predictors.
Consider Cp, AIC and SBC/BIC to select among models.
To present your results, make a table listing the top models (about 5-8 models, using your discretion) and the values of their model selection criteria. Briefly describe the results and which model appears to be the “best”.
Answer:
After creating the new variable nurses/census, we check the distribution of all predictors. The distribution of beds, length and nurses/census is right-skewed, and we log-transform it.
Call:
lm(formula = risk ~ logbeds + xray + loglength + lognuratio,
data = senic_transformed)
Residuals:
Min 1Q Median 3Q Max
-1.7345 -0.6425 -0.0777 0.5684 2.8133
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.784931 1.179192 -4.906 3.31e-06 ***
logbeds 0.509710 0.142622 3.574 0.000527 ***
xray 0.016694 0.005339 3.127 0.002270 **
loglength 2.757251 0.649407 4.246 4.63e-05 ***
lognuratio 1.040605 0.276438 3.764 0.000272 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9673 on 108 degrees of freedom
Multiple R-squared: 0.4982, Adjusted R-squared: 0.4796
F-statistic: 26.81 on 4 and 108 DF, p-value: 1.874e-15
Here we use best subset selection to select the best model. The best model is the one with the smallest Cp, AIC, and SBC and we arrange the best 8 models according to Cp. So the first row is the predictors in the best model, log(beds), xray, log(length) and log(nurses/census). The best model is as follows: